Chapter 8: Linear regression

Author

Colin Foster

Welcome to the online content for Chapter 8!

As always, I’ll assume that you’ve already read up to this chapter of the book and worked through the online content for the previous chapters. If not, please do that first.

As always, click the ‘Run Code’ buttons below to execute the R code. Remember to wait until they say ‘Run Code’ before you press them. And be careful to run these boxes in order if later boxes depend on you having done other things previously.

Fitting regression lines

This chapter revolved around a single data set of 12 people’s heights at different locations. Let’s begin by reading in that data set.

Let’s use plot to see what the data look like:

This matches the plot that we saw in the chapter.

Now let’s fit a least-squares regression line. To do this, we use the lm function, which stands for ‘linear model’. We put in a ‘formula’ using the tilde symbol ~ that we used for ANOVAs in the previous chapter.

The output that we get shows us the parameter estimates: the intercept (166.2 cm) and the slope (0.469 cm/km). These match the values in the chapter.

If we want more detail, we can put summary(…) around the lm function, the same way as we did with the aov function in the previous chapter.

We get a lot more detail now.

Under ‘Coefficients:’ we get the same intercept and slope estimates (in the first, ‘Estimate’ column). Each of these comes with its standard error, and the estimate divided by the standard error is the \(t\) value, in the next column - and the ‘Pr(>|t|)’ column is the \(p\) value.

The intercept has a ridiculously small \(p\) value. The number 2e-16 is written in standard form, because it’s so small. This is a useful notation for very large or very small numbers. The notation 2e-16 means \(2 \times10^{-16}\), which is equal to 0.000,000,000,000,000,2. You can see why it’s easier to write in standard form, because writing out all of those zeroes is tedious and error-prone! It’s essentially zero. The \(p\) value for the intercept is so small because the null hypothesis being tested is that the population height intercept is 0 cm, which is a ridiculous null hypothesis. So, we usually ignore the intercept \(p\) value.

The \(p\) value that we care about is the one for the slope, because the slope being zero isn’t a ridiculous null hypothesis. The slope being zero corresponds to no relationship between location and height, and that’s exactly the thing that we want to test. The location \(p\) value is .009, which is significant at the 5% level, as explained in the chapter.

The ‘Residuals:’ part of the output tells us about the sizes of the residuals. The largest residual is 3.4469 cm. This corresponds to the point that is most underestimated by the model, because 3.4469 has to be added to the model prediction to get the true data value. Similarly, -2.4802 corresponds to the point that is most overestimated, because 2.4802 has to be subtracted from the prediction of the line to get the true data point value. The 1Q, Median and 3Q values give the quartiles of the residuals.

Finally, the numbers at the bottom of the output tell us the values for an ANOVA comparing the model (the line) to the error (the residuals). The \(p\) value is exactly the same as the location \(p\) value, because a significant slope is equivalent to a significant model in simple linear regression (with just one predictor).

The ‘Multiple R-squared’ is just the \(R^2\) value mentioned in the chapter, which tells us that the model (i.e. location) accounts for about 51% of the variance in the heights of the people. Adjusted \(R^2\) is a more conservative estimate of the model’s predictive power that takes account of the fact that we have only 12 data points. If we had a very large number of data points, \(R^2\) and Adjusted \(R^2\) would be very similar.